50 research outputs found

    Statistical Tests for Detecting Differential RNA-Transcript Expression from Read Counts

    Get PDF
    As a fruit of the current revolution in sequencing technology, transcriptomes can now be analyzed at an unprecedented level of detail. These advances have been exploited for detecting differential expressed genes across biological samples and for quantifying the abundances of various RNA transcripts within one gene. However, explicit strategies for detecting the hidden differential abundances of RNA transcripts in biological samples have not been defined. In this work, we present two novel statistical tests to address this issue: a 'gene structure sensitive' Poisson test for detecting differential expression when the transcript structure of the gene is known, and a kernel-based test called Maximum Mean Discrepancy when it is unknown. We analyzed the proposed approaches on simulated read data for two artificial samples as well as on factual reads generated by the Illumina Genome Analyzer for two _C. elegans_ samples. Our analysis shows that the Poisson test identifies genes with differential transcript expression considerably better that previously proposed RNA transcript quantification approaches for this task. The MMD test is able to detect a large fraction (75%) of such differential cases without the knowledge of the annotated transcripts. It is therefore well-suited to analyze RNA-Seq experiments when the genome annotations are incomplete or not available, where other approaches have to fail

    Detection and characterisation of RNA processing variation from deep RNA sequencing data

    Get PDF
    The introduction of high-throughput sequencing technologies has opened unprecedented opportunities for research on the regulation of ribonucleic acid (RNA) processing, which is central to cellular information processing. By enabling accurate and extensive measurements of various properties of cellular RNAs, these techniques allow to systematically investigate the transcriptome and its regulation on a genome-wide scale. The development of computational methods to analyse the resulting data, however, is still lagging behind the advances in experimental data generation. In this thesis, we present novel approaches to leverage the potential of high-throughput sequencing technologies for studying the regulation of RNA processing. More specifically, we focused on the following three research problems: First, we investigated how to best extract information from RNA-sequencing (RNA-Seq) data and how to design RNA-Seq experiments in order to maximise their utility for answering the investigated question. For this purpose, we derived a probabilistic model to estimate the utility of RNA-Seq experiments as a function of the experimental parameters for typical analyses such as the identification of transcripts and the detection of differential splicing. Application of our models provided fundamental, experimentally supported insights into how particular experimental parameters influence the amount of information gained from an RNA-Seq experiment. Based on these insights, we suggest strategies for an improved experimental design of transcriptome analysis experiments. The second investigated aspect was the detection of differential RNA processing based on high-throughput sequencing data. Here, we proposed novel statistical tests to detect changes in RNA processing for two distinct settings: When the gene annotation is complete (which is often the case for model organism) and for the case where the gene annotation is incomplete or unknown (as it is the case for non-model organism or pathological phenotypes). We showed that both on realistically simulated and on experimental data our newly developed tests out-competed state-of-the-art methods. Furthermore, we showed how our methods could be extended to detect differential RNA secondary structure and to associate changes in RNA processing with genetic variation. Finally, we successfully applied our methods to investigate the role of splicing in human cancer cells, to understand mechanisms of nonsense mediated decay in A. thaliana and to reveal regulatory structural motives of translation in human. The third investigated aspect was the characterisation of changes in RNA processing. We showed that combining RNA-Seq data with information on genomic variation and transcription factor binding preferences explained causes of gene expression variation. For this, we first performed a comprehensive analysis of gene expression landscape in an A. thaliana population. Furthermore, we showed that there is a significant enrichment of genetic variants associated with gene expression in predicted transcription factor binding sites. Finally, we showed that alterations of transcription factor binding sites are a major driver of gene expression variation. Overall, we addressed different aspects of the detection and characterisation of RNA processing. Using our new methods we have gained novel insights into the regulation of RNA processing. However, the work has also shown that there are still several open questions, which should be addressed in future studies.Die Regulierung der Ribonukleinsäure (RNS)-Prozessierung ist von zentraler Bedeutung für die zelluläre Informationsverarbeitung. Die Einführung von Technologien zur Hochdurchsatzsequenzierung (HTS) hat zur weiteren Erforschung dieses Gebietes neue Chancen eröffnet. Da diese Techniken umfangreiche und genaue Messungen verschiedener Eigenschaften der zellulären RNSs erlauben, ermöglichen sie die genomweite systematische Untersuchung des Transkriptoms und dessen Regulierung. Die Entwicklung von Methoden zur Analyse der resultierenden Daten ist jedoch nicht so fortgeschritten wie die experimentellen Datenerzeugung. In unserer Arbeit präsentieren wir neue Ansätze, um das Potenzial der HTS zur Untersuchung der Regulation der RNS-Prozessierung auszuschöpfen. Hierbei konzentrierten wir uns auf die folgenden drei Aspekte: Zum ersten, wie Informationen aus den RNS-Sequenzierungs (RNS-Seq)-Daten extrahiert werden können und wie RNS-Seq-Experimente konzipiert werden müssen, um einen maximalen Nutzen zu generieren. Zu diesem Zweck haben wir, abhängig von den Parametern des jeweiligen Experiments, probabilistische Modelle hergeleitet, um die Nützlichkeit der RNS-Seq- Experimente für gängige Analysen, wie beispielsweise die Identifizierung von Transkripten und die Erkennung von differentiellem Spleissen, zu bestimmen. Die Anwendung unserer Modelle ermöglicht es, grundsätzliche, durch experimentelle Daten bestätigte Einsichten zu erlangen, wie die experimentellen Parameter den Informationsgewinn von RNS-Seq-Experimenten beeinflussen. Auf diesen Erkenntnissen basierend, schlagen wir verbesserte Versuchspläne für Experimente zur Transkriptomanalyse vor. Der zweite Aspekt war die Erkennung von Änderungen in der RNS-Prozessierung mit Hilfe von HTS-Daten. Hier präsentieren wir neuartige statistische Tests, um in zwei verschiedenen Anwendungsgebieten Änderungen in der RNS-Prozessierung zu detektieren: (a) für den Fall der vollständigen Genannotation, was oft bei Modellorganismen zutrifft, aber auch (b) für den Fall dass die Genannotation unvollständig oder unbekannt ist. Letzteres ist häufig bei Nicht-Modellorganismen oder pathologische Phänotypen der Fall. In dieser Arbeit konnten wir zeigen, dass unsere neu entwickelten Tests anderen modernen Methoden überlegen waren, sowohl bei Anwendung auf realistisch simulierten als auch auf experimentellen Daten. Darüber hinaus zeigten wir, wie unsere Methoden erweitert werden können, um Unterschiede in RNS-Sekundärstrukturen zu erkennen und auch um differentielle RNS-Prozessierung mit genetischer Variation zu assoziieren. Schliesslich konnten wir zeigen, wie unsere Methoden angewandt werden können, um erstens die Rolle des Spleissens in menschlichen Krebszellen zu untersuchen, zweitens die dem Nonsense Mediated Decay zugrunde liegenden Mechanismen zu verstehen und drittens regulatorische Strukturmotive der Translation im Menschen zu entdecken. Der letzte Aspekt war die Charakterisierung von Änderungen der RNS-Prozessierung. Wir konnten zeigen, dass die gemeinsame Verwendung von RNS-Seq-Daten mit Informationen zur genomischen Variation und Transkriptionsfaktor (TF)-Bindungspräferenzen ermöglicht, den Mechanismus der Veränderung der Genexpression besser zu verstehen. Dazu haben wir zunächst eine umfassende Analyse der Genexpression in einer A. thaliana Population durchgeführt. Ausserdem haben wir demonstriert, dass eine signifikante Anreicherung von mit Genexpression assoziierten genetischen Varianten in vorhergesagten TF-Bindestellen (TFBS) vorhanden war. Zuletzt haben wir gezeigt, dass Veränderungen in den TFBS in Promotoren eine bedeutende Ursache von Genexpressionsvariation waren. Zusammenfassend haben wir unterschiedliche Aspekte der Detektion und Charakterisierung von RNS-Prozessierung untersucht. Mit Hilfe unserer neu entwickelten Methoden haben wir neue Einsichten in die Regulation von RNS-Prozessierung erhalten. Unsere Arbeit zeigte jedoch, dass es immer noch viele offene Fragestellungen gibt, welche in zukünftigen Untersuchungen behandelt werden sollten

    Deep learning for prediction of population health costs

    Full text link
    Accurate prediction of healthcare costs is important for optimally managing health costs. However, methods leveraging the medical richness from data such as health insurance claims or electronic health records are missing. Here, we developed a deep neural network to predict future cost from health insurance claims records. We applied the deep network and a ridge regression model to a sample of 1.4 million German insurants to predict total one-year health care costs. Both methods were compared to Morbi-RSA models with various performance measures and were also used to predict patients with a change in costs and to identify relevant codes for this prediction. We showed that the neural network outperformed the ridge regression as well as all Morbi-RSA models for cost prediction. Further, the neural network was superior to ridge regression in predicting patients with cost change and identified more specific codes. In summary, we showed that our deep neural network can leverage the full complexity of the patient records and outperforms standard approaches. We suggest that the better performance is due to the ability to incorporate complex interactions in the model and that the model might also be used for predicting other health phenotypes

    Oqtans: a Galaxy-integrated workflow for quantitative transcriptome analysis from NGS Data : From Seventh International Society for Computational Biology (ISCB) Student Council Symposium 2011 Vienna, Austria. 15 July 2011

    Get PDF
    First published by BioMed Central: Schultheiss, Sebastian J.; Jean, Géraldine; Behr, Jonas; Bohnert, Regina; Drewe, Philipp; Görnitz, Nico; Kahles, André; Mudrakarta, Pramod; Sreedharan, Vipin T.; Zeller, Georg; Rätsch, Gunnar: Oqtans: a Galaxy-integrated workflow for quantitative transcriptome analysis from NGS Data - In: BMC Bioinformatics. - ISSN 1471-2105 (online). - 12 (2011), suppl. 11, art. A7. - doi:10.1186/1471-2105-12-S11-A7

    DNA methylation variation in Arabidopsis has a genetic basis and shows evidence of local adaptation

    Full text link
    Epigenome modulation in response to the environment potentially provides a mechanism for organisms to adapt, both within and between generations. However, neither the extent to which this occurs, nor the molecular mechanisms involved are known. Here we investigate DNA methylation variation in Swedish Arabidopsis thaliana accessions grown at two different temperatures. Environmental effects on DNA methylation were limited to transposons, where CHH methylation was found to increase with temperature. Genome-wide association mapping revealed that the extensive CHH methylation variation was strongly associated with genetic variants in both cis and trans, including a major trans-association close to the DNA methyltransferase CMT2. Unlike CHH methylation, CpG gene body methylation (GBM) on the coding region of genes was not affected by growth temperature, but was instead strongly correlated with the latitude of origin. Accessions from colder regions had higher levels of GBM for a significant fraction of the genome, and this was correlated with elevated transcription levels for the genes affected. Genome-wide association mapping revealed that this effect was largely due to trans-acting loci, a significant fraction of which showed evidence of local adaptation. These findings constitute the first direct link between DNA methylation and adaptation to the environment, and provide a basis for further dissecting how environmentally driven and genetically determined epigenetic variation interact and influence organismal fitness.Comment: 38 pages 4 figure

    A Comparison of Nine Machine Learning Mutagenicity Models and Their Application for Predicting Pyrrolizidine Alkaloids

    Get PDF
    Random forest, support vector machine, logistic regression, neural networks and k-nearest neighbor (lazar) algorithms, were applied to a new Salmonella mutagenicity dataset with 8,290 unique chemical structures utilizing MolPrint2D and Chemistry Development Kit (CDK) descriptors. Crossvalidation accuracies of all investigated models ranged from 80 to 85% which is comparable with the interlaboratory variability of the Salmonella mutagenicity assay. Pyrrolizidine alkaloid predictions showed a clear distinction between chemical groups, where otonecines had the highest proportion of positive mutagenicity predictions and monoesters the lowest

    Modeling Structure-Activity Relationship of AMPK Activation.

    Get PDF
    The adenosine monophosphate activated protein kinase (AMPK) is critical in the regulation of important cellular functions such as lipid, glucose, and protein metabolism; mitochondrial biogenesis and autophagy; and cellular growth. In many diseases-such as metabolic syndrome, obesity, diabetes, and also cancer-activation of AMPK is beneficial. Therefore, there is growing interest in AMPK activators that act either by direct action on the enzyme itself or by indirect activation of upstream regulators. Many natural compounds have been described that activate AMPK indirectly. These compounds are usually contained in mixtures with a variety of structurally different other compounds, which in turn can also alter the activity of AMPK via one or more pathways. For these compounds, experiments are complicated, since the required pure substances are often not yet isolated and/or therefore not sufficiently available. Therefore, our goal was to develop a screening tool that could handle the profound heterogeneity in activation pathways of the AMPK. Since machine learning algorithms can model complex (unknown) relationships and patterns, some of these methods (random forest, support vector machines, stochastic gradient boosting, logistic regression, and deep neural network) were applied and validated using a database, comprising of 904 activating and 799 neutral or inhibiting compounds identified by extensive PubMed literature search and PubChem Bioassay database. All models showed unexpectedly high classification accuracy in training, but more importantly in predicting the unseen test data. These models are therefore suitable tools for rapid in silico screening of established substances or multicomponent mixtures and can be used to identify compounds of interest for further testing

    Pharmacokinetics of Transdermal Etofenamate and Diclofenac in Healthy Volunteers

    No full text
    Little is known about the course of the plasma concentration and the bioavailability of non-steroidal anti-inflammatory drugs (NSAIDs) contained in dermal patches. We compared an etofenamate prototype patch (patent EP 1833471) and a commercially available diclofenac epolamine patch regarding the bioavailability of the active ingredients relative to respective i.m. applications and regarding their plasma concentration-time course. Twenty-four healthy human volunteers were treated using a parallel group design (n = 12 per group) with a single dermal patch (removed after 12 hr) followed (after a latency of 48 hr) by eight consecutive dermal patches every 12 hr to reach steady-state conditions. The patches were generally well tolerated, but one volunteer treated with etofenamate developed an allergic contact dermatitis. After the first patch, Cmax was 0.81 ± 0.11 (mean ± S.E.M.) ng/mL (reached 12 hr after patch removal) for diclofenac and 31.3 ± 3.8 ng/mL for flufenamic acid (reached at patch removal), the main metabolite of etofenamate. Etofenamate was not detectable. After repetitive dosing, trough plasma concentrations after the eighth dose were 1.72 ± 0.32 ng/mL for diclofenac and 48.7 ± 6.6 ng/mL for flufenamic acid. Bioavailabilities (single dose) relative to i.m. applications were 0.22 ± 0.04% for diclofenac and 1.15 ± 0.06% for flufenamic acid. In conclusion, the relative bioavailability (compared to the respective i.m. application) of both drugs is low. The maximal plasma concentrations after topical administration of these drugs are well below the IC50 values for COX-1 and COX-2, explaining the absence of dose-dependent toxicities
    corecore